Cell Systems
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match Cell Systems's content profile, based on 167 papers previously published here. The average preprint has a 0.54% match score for this journal, so anything above that is already an above-average fit.
Lopez-Malo, M.; Maerkl, S. J.
Show abstract
Transcription factors (TFs) regulate gene expression by binding cis-regulatory DNA elements, yet how trans-regulatory characteristics such as TF affinity, concentration, and localization interact with cis-regulatory elements remains largely unclear. We systematically analyzed TF affinity mutants across abundance, and localization states and found that promoter binding-site strength most readily modulated expression levels, followed by TF localization and concentration, while affinity variations were mainly buffered. We further uncover performance trade-offs between TF abundance, localization, and affinity. Together, these results reveal how trans and cis factors collectively shape gene-regulatory output.
Wu, J.; Dong, L.; Jia, N.; Li, L.; Zhang, H.
Show abstract
Direct Preference Optimization (DPO) has emerged as a powerful paradigm for aligning generative models, yet its temporal optimization dynamics in the discrete diffusion space of proteins remain poorly understood. Existing approaches often assume that maintaining structural integrity while optimizing physicochemical properties requires simultaneous, tightly coupled reinforcement learning constraints. In this work, we challenge this assumption by uncovering a fundamental temporal decoupling between structural and functional alignment. Using antibody design as a testbed, extensive trajectory analysis reveals two distinct regimes: (1) Instant Structural Alignment, where the strong generative prior of discrete diffusion rapidly eliminates structural hallucinations via denoising within the first few epochs; and (2) Slow Property Adaptation, where physicochemical attributes improve gradually over a prolonged optimization window. We further identify a critical transition point around Epoch 50, which empirically defines a Pareto-optimal boundary between property improvement and structural stability. Beyond this point, continued optimization induces a sharp phase transition into a Structural Collapse regime. To isolate the physical driver underlying this collapse, we introduce a counterfactual preference experiment targeting negative charge. We observe a striking symmetrical collapse: while hydrophilicity optimization induces a Poly-Arginine (+) degeneration, negative charge optimization drives a Poly-Aspartate (-) degeneration. Despite opposite physicochemical trajectories, including extreme shifts in isoelectric point (> 11 vs. < 4.5), both regimes converge to the same structural failure. This symmetry demonstrates that generic Coulombic repulsion, rather than residue-specific bias, constitutes the fundamental physical constraint being violated. Our findings reveal that discrete diffusion models possess strong intrinsic structural robustness, enabling minimalist alignment strategies provided optimization halts before this physical boundary. More broadly, this work offers a mechanistic warning against unchecked reward optimization in biological generation, illustrating a concrete manifestation of Goodharts Law in protein design. Code and data are available at https://github.com/Wu-Junqi/DPO-Protein-Diffusion.
Arun, K. M.; Scher, Y.; Zhang, Y. D.; Büschel, I.; Kuznets-Speck, B.; Marr, C.; Goyal, Y.
Show abstract
Gene regulatory networks (GRNs) underlie maintenance of cellular phenotypes and responses to stimuli. Modern single-cell profiling methods offer high-throughput datasets to infer GRNs en masse but do not capture dynamic information. Cell-state heterogeneity further confounds correlation-based inference approaches. We addressed these challenges with TwINFER, a conceptual framework leveraging information from recently divided sister cells, or "twins," identifiable via recently developed barcoding techniques. We show that twin information discriminates regulatory from non-regulatory correlations and resolves interaction direction and type (activation/repression). We performed a diverse set of simulations, covering common network motifs and large-scale networks, where TwINFER outperforms state-of-the-art inference capabilities. Crucially, TwINFER resolved the commonly observed false positives in fan-out and feed-forward loop motifs where most methods perform poorly. Lastly, we applied TwINFER to a lineage-barcoded hematopoiesis dataset which refined the network inference, flagged multi-state genes, and determined causal relations. Our work exploits cellular twins as untapped information, readily complementing existing inference approaches.
Hendrychova, V.; Brinda, K.
Show abstract
One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the models predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.
Marken, J. P.; Prator, M. L.; Hay, B. A.; Murray, R. M.
Show abstract
Despite the fact that microbes in natural environments spend most of their time in growth arrest, we understand little about how this physiological state affects the performance of engineered genetic circuits. Here, we measure repression curves from a library of genetic NOT gates at single-cell resolution in Escherichia coli under both active growth and growth arrest to systematically investigate how growth arrest affects circuit behavior. We find that the impact of growth arrest on circuit performance is almost entirely dominated by a single effect: a >100-fold reduction in unrepressed expression levels. Growth arrest caused gene expression noise to increase moderately and had only minimal impacts on the sensitivity and sharpness of the repression curves. Our work shows both that conventional genetic circuit design paradigms are currently insufficient to develop circuits that can function properly under growth arrest, but also that addressing the reduction in just a single performance parameter would be sufficient to resolve this problem. This work expands our understanding of bacterial gene regulation under growth arrest and lays the groundwork for new design paradigms that will be essential in ensuring the safe and reliable performance of synthetic biology systems in real-world environments. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=87 SRC="FIGDIR/small/703179v1_ufig1.gif" ALT="Figure 1"> View larger version (14K): org.highwire.dtl.DTLVardef@3df103org.highwire.dtl.DTLVardef@9a2f5forg.highwire.dtl.DTLVardef@9c15aborg.highwire.dtl.DTLVardef@1529c39_HPS_FORMAT_FIGEXP M_FIG C_FIG
Chou, J.; Lin, C.; Chen, J.; Hart, T.
Show abstract
Genetic interactions (GI) reveal functional relationships for understanding gene function and identifying candidate therapeutic vulnerabilities. Combinatorial CRISPR technologies enable genome-scale GI mapping in mammalian cells, but existing analytical methods lack systematic validation against ground truths. We introduce GRAPE (Genetic interaction Regression Analysis of Pairwise Effects), a computational framework that identifies GIs from pooled CRISPR screens by using linear regression to estimate single-gene phenotypes and detecting deviations from expected double-knockout effects. To enable rigorous benchmarking, we developed Synulator, a pipeline that simulates realistic CRISPR screen data with defined synthetic lethal interactions while preserving experimental noise profiles. In simulated screens, GRAPE achieves greater precision and recall compared to existing methods, particularly for interactions with weaker effect sizes. Applying GRAPE to published combinatorial screens across cell lines and CRISPR platforms demonstrates concordance with original findings while identifying additional high-confidence interactions. GRAPE provides a robust, versatile tool for GI mapping, advancing functional genomics and the systematic discovery of synthetic lethal targets in cancer. TeaserA regression-based framework and simulations enable accurate detection of GIs from combinatorial CRISPR screens.
Boileau, R. M.; Golas, S. M.; Ma, Q.; Jiang, B.; a, A.; Jia, M.; Ilieva, N.; Baydush, A.; Fu, H.; Chory, E. J.
Show abstract
Natural evolution is high-dimensional; organisms adapt to many pressures at once, across substrates, environments, and genetic backgrounds. Yet most directed evolution methods flatten this landscape to a single selection axis, hiding tradeoffs, and limiting what can be learned. Phage-assisted continuous evolution (PACE) is uniquely suited for multivariate selection because horizontal gene transfer couples genotype to propagation and allows the same phage lineage to traverse different selection environments. In practice, implementing this at scale has been prohibitive because each selection demands its own host culture, and every culture must be held for days to weeks within a narrow, infectable density window using continuously responsive bioreactors. In this work, TurboPRANCE is presented as an open-source, queueable robotic platform that integrates [~]200 independently controlled turbidostats with 96 parallel PACE lagoons under closed-loop control. Each turbidostat operates as a fully separate unit that can be equilibrated and initiated on its own schedule, enabling asynchronous starts and sustained operation without intervention. Automated media formulation, programmable dosing, on-deck sterilization, and adaptive scheduling coordinate growth control with the changing needs of the robotic workflow, dynamically adjusting dilution and transfer timing around formulation, sampling, and handling steps to keep each culture at consistent infectable densities despite unpredictable method demands. Cultures can be multiplexed and titrated into lagoons at defined ratios, swapped in and out on a schedule, or kept fully separate across experiments, creating a combinatorial space of selection pressures and programs that is effectively unbounded. Additionally, to enable high-throughput evolutionary tracking that scales with TurboPRANCE, Nanopore long-read sequencing was combined with DeepVariant, a deep learning-based variant caller, enabling population-level tracking of evolving variants. The result is a system that generates high-resolution time-resolvable evolutionary trajectories and large parallel datasets spanning diverse selection regimes, yielding dense, multivariate training data to map and engineer complex fitness landscapes at scale. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=81 SRC="FIGDIR/small/709196v2_ufig1.gif" ALT="Figure 1"> View larger version (36K): org.highwire.dtl.DTLVardef@b0f8daorg.highwire.dtl.DTLVardef@a1d57eorg.highwire.dtl.DTLVardef@c771aeorg.highwire.dtl.DTLVardef@1f8729a_HPS_FORMAT_FIGEXP M_FIG Turbidostat, Phage, and Robotics-Assisted Near Continuous Evolution (TurboPRANCE) In phage-assisted continuous evolution (PACE), biomolecular activity is coupled to pIII expression, linking function to phage propagation. By altering the host strain, distinct selection pressures can be imposed on the same evolving phage population. In TurboPRANCE, [~]200 selection programs can vary over time, including periodic "drift" (mutagenesis), alternation between pressures, or rotational reassignment of host sources, enabling a combinatorial space of selection pressures. C_FIG
Verkuijl, S. A. N.; Ivimey-Cook, E. R.; Liu, B.; Bonsall, M. B.; Leftwich, P. T.; Windbichler, N.
Show abstract
Homing gene drives can bias their inheritance above Mendelian expectations, but reported outcomes vary widely. We compiled a cross-species dataset of nearly one million scored progeny from 42 publications reporting CRISPR/Cas9 endonuclease-based gene drives in 10 model, pest, and vector species. Using multilevel meta-analytic models, we evaluate biological, cross, and transgene design factors as predictors of biased drive inheritance. Species is the strongest predictor, but most heterogeneity remains unexplained; design features each explain a modest fraction of the remaining variation, with large construct-to-construct differences pointing to the full combination of design choices rather than any single factor. Nuclease expression timing, the most common optimization target, has limited predictive value after accounting for correlated factors, and predictions do not transfer well between species. Maternal nuclease deposition has a marginal effect on drive inheritance but dramatically increases somatic phenotype rates in offspring, revealing a tissue-of-action rather than repair-outcome effect. An interactive web tool enables community analysis of this dataset, which will guide the design of more efficient gene drives for genetic vector control, invasive species management and other applications. https://sverkuijl.shinyapps.io/GeneDrive/
Ibarraran, S.; Chennakesavalu, S.; Hu, F.; Rotskoff, G. M.
Show abstract
Directed evolution is a powerful and widely used technique for protein engineering, and reducing the cost of iterated experimental observations has become a major priority for practitioners. A number of recent efforts to use machine-learning-based predictors to improve sequence selection have led to remarkable improvements in efficiency, but the sparse data at each experimental iteration restricts these approaches to extremely simple models. Adapting large-scale pre-trained protein language models using experimental data offers an alternative that we show productively leverages the strong inductive biases of the natural distribution of protein sequences to navigate high-dimensional, combinatorially large fitness landscapes. Our approach uses a general-purpose "post-training" algorithm grounded in statistical physics that employs quantitative experimental rankings to directly produce a sampler for diverse, high fitness sequences with fewer data points than competing methods. The resulting adapted protein language model can itself be studied and interpreted, shedding further light on the biophysical characteristics of highly fit sequences and their properties.
Feist, A. M.; Woo, S.; Lim, H. G.; Norton-Baker, B.; Lind, T. M.; Gladden, N. E.; Chen, Y.; Eng, T.; Johnson, C. W.; Mukhopadhyay, A.; Petzold, C. J.; Guss, A. M.; Beckham, G. T.
Show abstract
Efficient co-utilization of hexose and pentose sugars from lignocellulose is essential for microbial production of bio-based chemicals, yet engineered non-native catabolic pathways can be suboptimal and evolutionarily unstable in complex resource environments. We used a Pseudomonas putida strain, previously engineered to catabolize xylose and arabinose to examine how resource abundance, temporal availability, and sub-culturing criteria shape evolutionary outcomes. Using an automated adaptive laboratory evolution (ALE) platform, we evolved the strain under static conditions with single selection pressures and dynamic regimes that imposed selection pressures on multiple sugars. These environments drove divergence between catabolic specialists and generalists. While selection regimes with weak or absent selection for xylose frequently resulted in loss of xylose catabolism, evolution under carbon-limited, mixed-sugar environments promoted stable retention and coordinated optimization of multiple catabolic pathways, increasing total sugar consumption in mixed-sugar conditions. Genomic, proteomic, and biochemical analyses showed that evolutionary stability was determined by pathway-specific fitness costs, leading to either pathway loss or cost-reducing refinement, depending on selection strength. An isolated generalist clone also exhibited improved indigoidine production from mixed sugars when compared to the parental strain. Together, these findings link resource dynamics to fitness landscapes that determine catabolic specialization, generalization, evolutionary trade-offs, and applicability to bioconversion.
Thiel, M.; Cunningham, A.; Barnes, C. P.
Show abstract
Reinforcement learning has driven the mass adoption of large language models by unlocking unexpected capabilities, yet this approach remains largely underexplored for generative DNA models. We investigate whether similar post-training techniques can induce emergent biological realism in DNA language models, using plasmid generation as a testbed due to plasmids relative simplicity, well-characterized functional constraints, and ubiquity in biotechnology. Using Group Relative Policy Optimization with a reward function based on constraints from engineered biology, our model achieves a 77% quality control pass rate compared to 5% for the pretrained baseline. Remarkably, beyond explicitly optimized features, the model exhibits surprising biological parallels: generated sequences match natural plasmids in thermodynamic stability, codon usage patterns, and ORF length distributions, properties not explicitly optimized in the reward function. These results suggest that RL post-training can steer DNA language models toward biologically coherent regions of sequence space, analogous to how such techniques unlock unexpected capabilities in natural language models, particularly in verifiable domains.
Heidari, M.; Karimpour, M.; Srivatsa, S.; Montazeri, H.
Show abstract
Predicting cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology and a key step toward building in silico virtual cells. The rapid growth of perturbation datasets and advances in deep-learning models have raised expectations for accurate and generalizable prediction. We show that these expectations are overly optimistic, largely due to the failure modes of existing evaluation metrics. In this study, using cross-splitting, controlled noise experiments, and synthetic data, we systematically evaluate both prediction models and evaluation metrics. We demonstrate that widely used metrics, including correlation-based measures and common distributional distances, are strongly influenced by scale, sparsity, and dimensionality, often misrepresenting model performance. In particular, the Wasserstein distance fails in high-dimensional gene expression spaces under variance scaling, while the Energy distance can overlook disruptions in gene-gene dependencies. Our analyses further reveal that complex deep learning models often underperform simple baselines and remain far from empirical performance bounds across multiple chemical perturbation datasets. Together, our framework exposes critical pitfalls, establishes robust evaluation guidelines, and provides a foundation for trustworthy benchmarking toward reliable virtual-cell models.
Simensen, V.; Almaas, E.
Show abstract
Metal-binding proteins account for nearly half of the characterized proteome, and they rely on metal-binding sites (MBSs) as critical determinants of their structural stability and biological function. However, methods for comparing their local binding environments lag behind those for whole-structure alignment. Here, we represent MBSs as atomic point clouds surrounding bound metal ligands and align them with a fine-tuned iterative closest point algorithm. Applying this framework to a redundancy-reduced collection of MBSs derived from all metalloproteins in the Protein Data Bank (PDB), we perform pairwise alignments across 23,342 sites to construct a similarity network of metal-binding environments. The resulting network topology recapitulates metal coordination chemistry and enzyme function: links are strongly enriched within metal types and across shared EC subclasses. Conserved metalloenzyme families form cohesive subnetworks; for example, the binuclear ureohydrolase domain appears as two tightly connected components that also capture atypical members such as the dinickel metformin hydrolase. We observe only a moderate global association between protein sequence and MBS geometry, yet many network links connect near-identical binding-site architectures across proteins with low sequence identity, consistent with either divergent evolution with local MBS conservation or candidate cases of molecular convergent evolution. Integrating network proximity with structural evidence of drug binding identifies drugs with enriched connectivity among their targets and predicts 528 drug-off-target combinations across 88 drugs and 151 human proteins, recovering both known off-targets (e.g., ADAM/ADAMTS for matrix metalloproteinase inhibitors) and proposing novel ones. The MBS network thus provides a scalable resource for probing metalloprotein evolution, functional convergence, and the structural basis of drug cross-reactivity. Author summaryWe study how metals shape protein structure and function by comparing metal-binding sites (MBSs) rather than whole proteins. We represent each MBS site as a point cloud of atoms surrounding the bound metal and align 23,342 sites from the Protein Data Bank (PDB) with a fine-tuned iterative closest point algorithm. This yields a similarity network whose links mirror metal coordination chemistry and enzymatic roles: sites binding the same metal or sharing enzyme classes cluster together, and conserved metalloenzyme families (e.g., binuclear ureohydrolases) form tight subnetworks that also capture atypical members such as a dinickel metformin hydrolase. Because highly similar MBS geometries often link proteins with low sequence identity, the MBS network highlights candidates consistent with either divergent evolution with locally conserved MBS architecture or convergent evolution toward similar coordination geometries in otherwise unrelated protein contexts. Overlaying known drug-binding sites lets us flag drugs whose targets are tightly connected and propose plausible off-targets, recovering known matrix metalloproteinase off-targets and suggesting new ones. Our approach offers a scalable map of metalloprotein relationships useful for studying evolution and anticipating drug cross-reactivity.
Ward, M.; Richardson, M.; Lin, H.; Stamm, M.; Wright, K.; Kim, A.; Bicknell, A.; Ahmed, N.; Jones, A.; Davis, J. W.; Metkar, M.
Show abstract
mRNA medicines hold great promise, but designing sequences with high translation efficiency, robust in-solution stability, and manufacturability remains a major challenge due to the vast combinatorial space of synonymous coding sequences. Computational approaches such as mRNA folding algorithms have emerged as powerful tools by co-optimizing for in-solution stability and translation efficiency, yet current methods face important limitations. Here, we present "mRNAfold", an improved mRNA folding algorithm and software package that addresses these gaps by enabling efficient exploration of diverse near-optimal solutions, incorporating untranslated regions (UTRs), parallel execution, and supporting tunable control over local structural features across the mRNA. Thermodynamically optimized mRNAs from mRNAfold were more stable ({approx} 2-fold) in-solution than those generated by simple GC maximization for the same encoded protein. In addition, mRNAs designed to vary local structure near the start codon while maintaining consistent structure and codon optimality elsewhere showed a complex relationship between local structure near the start codon and protein production in cells. We observed no impact of structure in the start codon region for a set of mRNAs with high codon optimality, but it did impact protein production for a set of mRNAs with lower codon optimality. Together, these results underscore the potential of structure-aware, multi-objective design to improve mRNA medicines and offer a framework for exploring how sequence, structure, and expression are interrelated.
Khan, S. R.; Sashittal, P.
Show abstract
Tumors comprise subpopulations of cells that harbor distinct collections of somatic mutations, ranging from single-nucleotide variants (SNVs) to large-scale copy-number aberrations (CNAs). Single-cell whole-genome sequencing (scWGS) enables direct measurement of these mutations; however, inferring tumor phylogenies from scWGS data remains challenging due to ultra-low coverage ([~]0.05 x). There may be multiple ways of imputing missing information in the data leading to distinct tumor phylogenies that are equally well supported by the data. Existing methods produce a single phylogeny and overlook this uncertainty in reconstructing evolutionary histories from sparse scWGS data. We present SCOPE, a novel algorithmic framework that characterizes the space of tumor phylogenies consistent with scWGS data under a copy-number constrained version of the perfect phylogeny model. Our approach relies on estimating the cell fraction of each mutation, i.e. the proportion of cells within each copy-number cluster that carry the mutation. We derive the necessary and sufficient conditions these fractions must satisfy to admit a copy-number constrained perfect phylogeny. This yields a complete combinatorial description of all tumor phylogenies that are supported by the data under our model. We prove that identifying the largest subset of mutations with cell fractions satisfy model constraints using noisy measurements of cell fractions is NP-hard. On simulated data, SCOPE outperforms existing methods in accuracy with faster runtime in particular on the larger simulations. On scWGS data from a patient-derived ovarian cancer cell line, SCOPE infers a more resolved phylogeny with stronger statistical support compared to existing methods. Using SCOPE to analyze a larger dataset of 4 triple negative breast cancer (TNBC) and 8 high-grade serous ovarian cancer (HGSOC) samples, we show that several samples admit multiple phylogenies. We further find that number of admissible phylogenies increases with lower sequencing coverage and is negatively correlated with the number of copy-number clusters and number of distinct loss of heterozygosity (LOH) events in the clusters, highlighting how data quality and evolutionary constraints jointly shape uncertainty in tumor phylogeny reconstruction. By providing a principled framework for exploring and quantifying phylogenetic uncertainty, SCOPE establishes a new foundation for robust inference of tumor evolution from scWGS data. Code availabilitySoftware is available at https://github.com/sashittal-group/SCOPE
Bixby, E.; Brunner, G.; Danciu, D.; Dela Rosa, R.; Deutschmann, N.; Ferragu, C.; Geiger, F.; Holberg, C.; Kidger, P.; Lindoulsi, A.; Lutz, N.; McColgan, T.; Milius, S.; Shah, J.; Vandeloo, M.; Vidas, P.; Ziegler, J. D.; van Rossum, H.; van der Vorm, D.; Baldi, N.; IJSpeert, C.; Monza, E.; Schriek, A.
Show abstract
Lead optimization remains the longest and most expensive step in pre-clinical drug discovery, typically consuming 12-36 months whilst costing $5M-$15M per candidate. We introduce O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP, an automated framework for protein engineering. While O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP supports the full process of drug discovery and industrial protein engineering pipelines, including hit identification and de novo binder design, this work focuses on its application to multi-property lead optimization across protein modalities (VHHs, scFvs, IgGs, peptides, enzymes, CRISPR systems, vaccines). We show it is 4-7x faster than rational design, as measured by the number of wet lab rounds required. We provide in-vitro validation across all of the above modalities, typically optimizing multiple properties simultaneously (single and polyspecific binding down to picomolar, activity, thermostability,...). Technically, O_SCPLOWCRADLEC_SCPLOWO_SCPCAP-1C_SCPCAP starts with pre-trained foundation protein language models (PLMs), which are fine-tuned in unsupervised fashion on evolutionary neighborhoods, in supervised fashion using lab-in-the-loop data, and then deployed in a multi-model workflow. Of additional interest, we find that (a) the end-to-end system may be run in automated fashion; (b) wet lab data may be consumed in black box fashion without knowledge of the underlying biochemical mechanisms; (c) structural data may largely be superseded by sequence-function pairs.
Richter, T.; Zimmermann, E.; Hall, J.; Theis, F. J.; Raghavan, S.; Winter, P. S.; Amini, A. P.; Crawford, L.
Show abstract
The vision of a "virtual cell"--a computational model that simulates biological function across modalities and scales--has become a defining goal in computational biology. While powerful unimodal foundation models exist, the lack of large-scale paired data prohibits the joint training of multimodal approaches. This scarcity favors compositional foundation models (CFMs): architectures that fuse frozen unimodal experts via a learned interface. However, it remains unclear when this multimodal fusion adds task-relevant information beyond the strongest unimodal representation and when it merely aggregates redundant signal. Here, we introduce the Synergistic Information Score (SIS), a metric grounded in partial information decomposition (PID), that quantifies the information gain achievable only through cross-modal interactions. Extending theoretical results from self-supervised learning, we show that standard alignment-based fusion objectives on frozen encoders inherently collapse to detecting linear redundancies, limiting their ability to capture nonlinear synergistic states. This distinction is directly relevant for tasks aiming to link tissue morphology and gene expression. Benchmarking ten fusion methods on spatial transcriptomics datasets, we use SIS to demonstrate that tasks dominated by linear redundancies are sufficiently served by unimodal baselines, whereas complex niche definitions benefit from synergy-aware integration objectives that enable cross-modal interactions beyond linear alignment. Finally, we perform a scaling analysis which highlights that fine-tuning a dominant unimodal expert is the most sample-efficient path for standard tasks, suggesting that the benefits of multimodal frameworks only emerge when tasks depend on information distributed across modalities. Together, these results establish that building towards a virtual cell will require a fundamental shift from alignment objectives that emphasize shared structure to synergy-maximizing integration that preserves and exploits complementary cross-modal signal.
Savino, A.; Oikonomou, A.; De Lucia, R. R.; Grau, M. L.; McCarten, K.; Najgebauer, H.; Perron, U.; Azzolin, L.; Livanova, A.; Cremaschi, P.; Lopez-Bigas, N.; Sottoriva, A.; Iorio, F.
Show abstract
Identifying cancer driver genes and mutations remains a cornerstone of cancer research and a prerequisite for developing effective targeted therapies. While current approaches have successfully uncovered recurrent oncogenic alterations, they often miss rare or context specific events, leaving large segments of the mutational landscape of human cancers functionally uncharacterised. We developed a CRISPR enhanced analytical framework that systematically identifies Dependency-Associated Mutations (DAMs): somatic variants linked to increased viability dependency on their hosting gene in cancer cells. To this aim, we designed a rank based metric applicable even to singleton variants and analysed large scale functional genomics data from over a thousand cancer cell lines. We discovered more than 2,000 DAMs, involving more than 1,000 in genes not previously reported as cancer drivers. These unreported DAM bearing genes reinforce canonical oncogenic pathways, revealing overlooked but functionally coherent nodes. By integrating drug response profiles, patient mutation data, and functional impact predictions, we distilled these findings into a refined set of hundreds high priority DAMs: variants that are not only functionally impactful and recurrent in tumours, but also encode druggable proteins and exhibit strong potential for clinical translation. Comparative analyses revealed significant overlap with an independent study, underscoring the robustness and reproducibility of our approach. All results are available through the CRISPR VUS Portal (https://vus-portal.fht.org), an interactive resource for exploring mutation dependency relationships across cancer-types. Our findings expand the functional and therapeutic landscape of cancer genomics, providing a scalable framework to interpret non recurrent variants and systematically uncover novel cancer vulnerabilities. By linking mutational profiles to gene essentiality and pharmacological sensitivity, our work extends the reach of precision oncology beyond canonical cancer drivers.
Nikitin, K.; Gursoy, G.
Show abstract
Polygenic Risk Scores (PRSs) estimate the likelihood of individuals to develop complex diseases based on their genetic variations. While their use in clinical practice and direct-to-consumer genetic testing is growing, the privacy implications of publicly sharing PRS values are often underestimated. In this work, we demonstrate that PRSs can be exploited to recover genotypes and to de-anonymize individuals. We describe how to reconstruct a portion of an individuals genome from a single PRS value by using dynamic programming and population-based likelihood estimation, which we experimentally demonstrate on PRS panels of up 50 variants. We highlight the risks of combining multiple, even larger-panel PRSs to improve genotype-recovery accuracy, which can lead to the re-identification of individuals or their relatives in genomic databases or to the prediction of additional health risks, not originally associated with the disclosed PRSs. We then develop an analytical frame-work to assess the privacy risk of releasing individual PRS values and provide a potential solution for sharing PRS models without decreasing their utility. Our tool and instructions to reproduce our calculations can be found at https://github.com/G2Lab/prs-privacy.
Camacho-Mateu, J.; Burgio, G.; Quiros-Rodriguez, I.; D Fernandez-de-Bobadilla, M.; Sanchez, A.
Show abstract
The function of microbial communities is often dominated by additive and pairwise interactions, raising the question of whether this reflects intrinsic biological simplicity or fundamental limits of detection. Here, we leverage the theory of fitness landscapes to bridge microbial ecology and genetics, and show that this apparent simplicity is a generic consequence of structural and statistical constraints rather than evidence for intrinsically weak higher-order interactions (HOIs). We separate the detectability of individual epistatic interactions from their contribution to functional variance, and demonstrate that local k-order interactions suffer from exponential noise amplification while their contributions to total variance are intrinsically suppressed by combinatorial geometric dilution. Applying this framework to a fully sampled 210 experimental microbial landscape, we find that only first- and second-order interactions are distinguishable from experimental noise. Furthermore, generalized Lotka-Volterra simulations reveal that experimental noise alone can generate the illusion of higher-order structure in communities where all direct mechanistic interactions are pairwise and indirect interactions are weak. Our findings identify universal, order-dependent limits on the quantification of epistasis that apply to high-dimensional landscapes across ecology and genetics, providing a principled foundation for rational community design.